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Abstract 

Let M be an na x n matrix of rank r n, and assume that a uniformly random subset E of 
its entries is observed. We describe an efficient algorithm that reconstructs M from \E\ = 0{rn) 
observed entries with relative root mean square error 



RMSE < C(a) 



X 1/2 

nr ^ 

W\ 



Further, if r — 0(1) and AI is sufficiently unstructured, then it can be reconstructed exactly from 
\E\ — 0{n\ogn) entries. 

This settles (in the case of bounded rank) a question left open by Candes and Recht and 
improves over the guarantees for their reconstruction algorithm. The complexity of our algorithm 
is 0{\E\r\ogn), which opens the way to its use for massive data sets. In the process of proving 
these statements, we obtain a generalization of a celebrated result by Friedman-Kahn-Szemeredi 
and Feige-Ofek on the spectrum of sparse random matrices. 

1 Introduction 

Imagine that each of m customers watches and rates a subset of the n movies available through a 
movie rental service. This yields a dataset of customer-movie pairs € E C [m] x [n] and, for 
each such pair, a rating Mij £ IR. The objective of collaborative filtering is to predict the rating for 
the missing pairs in such a way as to provide targeted suggestions^ The general question we address 
here is: Under which conditions do the known ratings provide sufficient information to infer the 
unknown ones? Can this inference problem be solved efficiently? The second question is particularly 
important in view of the massive size of actual data sets. 



1.1 Model definition 

A simple mathematical model for such data assumes that the (unknown) matrix of ratings has rank 
r <C m,n. More precisely, we denote by M the matrix whose entry (i,j) G [m] x [n] corresponds 
to the rating user i would assign to movie j. We assume that there exist matrices U, of dimensions 
m X r, and V, of dimensions n x r, and a diagonal matrix S, of dimensions r x r such that 

M = C/SI/^ . (1) 



'Department of Electrical Engineering, Stanford University 
^Departments of Statistics, Stanford University 

^Indeed, in 2006, Netflix made public such a dataset with m « 5 • 10®, n « 2 • 10* and \E\ ^ 10* and challenged 
the research community to predict the missing ratings with root mean square error below 0.8563 (Netj . 
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For justification of these assumptions and background on the use of low rank matrices in information 
retrieval, we refer to [BDJ99] . Since we are interested in very large data sets, we shall focus on the 
limit m, n — > oo with m/n = a bounded away from and oo. 

We further assume that the factors U , V are unstructured. This notion is formalized by the 
incoherence condition introduced by Candes and Recht |CR08j . and defined in Section[2l In particular 
the incoherence condition is satisfied with high probability if M = UTiV'^ with U and V uniformly 
random matrices with U'^U = ml and V'^V = nl. Alternatively, incoherence holds if the entries of 
U and V are i.i.d. bounded random variables. 

Out of the m X n entries of M, a subset E C [m] x [n] (the user/movie pairs for which a rating 
is available) is revealed. We let be the m x n matrix that contains the revealed entries of M, 
and is filled with O's in the other positions 

^'-^ 1^ otherwise. 
The set E will be uniformly random given its size \E\. 



1.2 Algorithm 

A naive algorithm consists of the following projection operation. 

Projection. Compute the singular value decomposition (SVD) of (with cri > ct2 > • • • > 0) 

min(m,n) 
i=l 

and return the matrix Tr(M^) = {'mn/\E\) Yl\=i (^i^iyf obtained by setting to all but the r largest 
singular values. Notice that, apart from the rescaling factor {mn/\E\), Tj.(M^) is the orthogonal 
projection of onto the set of rank-r matrices. The rescaling factor compensates the smaller 
average size of the entries of with respect to M. 

It turns out that, if \E\ = G(n), this algorithm performs very poorly. The reason is that the 
matrix contains columns and rows with (log n/ log log n) non-zero (revealed) entries. The 
largest singular values of are of order ©(y^log n/ log log n). The corresponding singular vectors 
are highly concentrated on high-weight column or row indices (respectively, for left and right singular 
vectors). Such singular vectors are an artifact of the high- weight columns/rows and do not provide 
useful information about the hidden entries of M. This motivates the definition of the following 
operation (hereafter the degree of a column or of a row is the number of its revealed entries). 

Trimming. Set to zero all columns in with degree larger that 2\E\/n. Set to all rows with 
degree larger than 2\E\/m. 

Figure [1] shows the singular value distributions of and for a random rank-3 matrix M. 
The surprise is that trimming (which amounts to 'throwing out information') makes the underlying 
rank-3 structure much more apparent. This effect becomes even more important when the number 
of revealed entries per row/column follows a heavy tail distribution, as for real data. 

In terms of the above routines, our algorithm has the following structure. 

Spectral Matrix Completion( matrix M'^ ) 
~[: Trim M^, and let be the output; 
2: Project to T^(M^); 

3: Clean residual errors by minimizing the discrepancy F{X,Y). 
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Figure 1: Histogram of the singular values of a partially revealed matrix before trimming (left) and after 
trimming (right) for Iff* x 10'' random rank-3 matrix M with e = 30 and S = diag(l, 1.1, 1.2). After trimming 
the underlying rank-3 structure becomes clear. Here the number of revealed entries per row follows a heavy 
tail distribution with P{iV = k} = const. /fc^. 



The last step of the above algorithm allows to reduce (or ehminate) small discrepancies between 
Tr(M^) and M, and is described below. 

Cleaning. Various implementations are possible, but we found the following one particularly ap- 
peahng. Given X G R"^^^ Y G E"^^ with X'^ X = ml and Y'^Y = nl, we define 

F(X,Y) = min riX,Y,S), (4) 

HX,Y,S) ^ \ {M,,-{XSY^),,f . (5) 

{i,i)6£; 

The cleaning step consists in writing Tr(M^) = X^S^^Y^ and minimizing F(X, Y') locally with initial 
condition X = Xq, Y = Yq. 

Notice that F{X, Y) is easy to evaluate since it is defined by minimizing the quadratic function 
S ^ ^{X, y, S) over the low-dimensional matrix S. Further it depends on X and Y only through 
their column spaces. In geometric terms, F is a function defined over the cartesian product of 
two Grassmann manifolds (we refer to Section [6] for background and references). Optimization 
over Grassmann manifolds is a well understood topic [EAS99J and efficient algorithms (in particular 
Newton and conjugate gradient) can be applied. To be definite, we assume that gradient descent 
with line search is used to minimize F{X,Y). 

Finally, the implementation proposed here implicitly assumes that the rank r is known. In 
practice this is a non-issue. Since r <C n, a loop over the value of r can be added at little extra cost. 
For instance, in collaborative filtering applications, r ranges between 10 and 30. 

1.3 Main results 

Notice that computing T^(M^) only requires to find the first r singular vectors of a sparse matrix. 
Our main result establishes that this simple procedure achieves arbitrarily small relative root mean 
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square error from 0{nr) revealed entries. We define the relative root mean square error as 



RMSE 



\M - Jr{M^ 



1/2 



(6) 



where we denote by the Frobenius norm of matrix A. Notice that the factor (l/mn) corresponds 

to the usual normalization by the number of entries and the factor (l/M^^^) corresponds to the 
maximum size of the matrix entries where M satisfies |-/Vfj .j| < Mmax for all i and j. 

Theorem 1.1. Assume M to be a rank r matrix of dimension na x n that satisfies \Mij\ < Mmax 
for all i,j. Then with probability larger than 1 — 



\M -JriM^)\\l < C 



o?l'^rn 



(7) 



for some numerical constant C . 



This theorem is proved in Section [3l 

Notice that the top r singular values and singular vectors of the sparse matrix can be 
computed efficiently by subspace iteration [Ber92j . Each iteration requires 0{\E\r) operations. As 
proved in Section [3l the (r + l)-th singular value is smaller than one half of the r-th one. As a 
consequence, subspace iteration converges exponentially. A simple calculation shows that O(logn) 
iterations are sufficient to ensure the error bound mentioned. 

The 'cleaning' step in the above pseudocode improves systematically over Tr(M^) and, for large 
enough \E\, reconstructs M exactly. 



Theorem 1.2. Assume M to be a rank r matrix that satisfies the incoherence conditions Al and 
A2 with {hq, fii). Let fi = max{^0) A*i}- Further, assume Smin < Si, . . . , < S^ax "u^ith SmiiD ^^max 
bounded away from and oo. Then there exists a numerical constant C such that, if 



\E\ > C'nr^/a (-^^) max i. fiQ log n , |I^r^/a(-^) |, 



(8) 



then the cleaning procedure in Spectral Matrix Completion converges, with high probability, to 
the matrix M . 

This theorem is proved in Section [6l The basic intuition is that, for \E\ > C'{a)nr maxjlog n,r}, 
Tr{M^) is so close to M that the cost function is well approximated by a quadratic function. 

Theorem 1 1.1 1 is optimal: the number of degrees of freedom in M is of order nr, without the same 
number of observations is impossible to fix them. The extra log n factor in Theorem 11.21 is due to a 
coupon-collector effect [CROSt IKMO08] IKOM09] : it is necessary that E contains at least one entry 
per row and one per column and this happens only for \E\ > Cnlogn. As a consequence, for rank r 
bounded, Theorem 11.21 is optimal. It is suboptimal by a polylogarithmic factor for r = O(logn). 



1.4 Related work 

Beyond collaborative filtering, low rank models are used for clustering, information retrieval, machine 
learning, and image processing. In |Faz02j . the NP-hard problem of finding a matrix of minimum 
rank satisfying a set of affine constraints was addresses through convex relaxation. This problem is 
analogous to the problem of finding the sparsest vector satisfying a set of affine constraints, which 
is at the heart of compressed sensing |Don061 ICRT06] . The connection with compressed sensing was 
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emphasized in |RFP07j . that provided performance guarantees under appropriate conditions on the 
constraints. 

In the case of collaborative filtering, we are interested in finding a matrix M of minimum rank 
that matches the known entries {Mjj : G i?}. Each known entry thus provides an affine 

constraint. Candes and Recht |CR08j introduced the incoherent model for M. Within this model, 
they proved that, if E is random, the convex relaxation correctly reconstructs M as long as \E\ > 
C r n^/^ log n. On the other hand, from a purely information theoretic point of view (i.e. disregarding 
algorithmic considerations), it is clear that \E\ = 0{nr) observations should allow to reconstruct M 
with arbitrary precision. Indeed this point was raised in |CR08| and proved in |KMQ08] . through a 
counting argument. 

The present paper describes an efficient algorithm that reconstructs a rank-r matrix from 0{nr) 
random observations. The most complex component of our algorithm is the SVD in step 2. We were 
able to treat realistic data sets with n ~ 10^. This must be compared with the O(n^) complexity of 
semidefinite programming [CR0 8] . 

Cai, Candes and Shen [CCSOSj recently proposed a low-complexity procedure to solve the convex 
program posed in |CR08j . Our spectral method is akin to a single step of this procedure, with the 
important novelty of the trimming step that improves significantly its performances. Our analysis 
techniques might provide a new tool for characterizing the convex relaxation as well. 

Theorem 1 1 . 1 1 can also be compared with a copious line of work in the theoretical computer science 
literature [FKV04t lAFK"'"dH IAM07] . An important motivation in this context is the development of 
fast algorithms for low-rank approximation. In particular, Achlioptas and McSherry |AM07| prove 
a theorem analogous to II. H but holding only for \E\ > (8 log n)^n (in the case of square matrices). 

A short account of our results was submitted to the 2009 International Symposium on Information 
Theory |KOM09j . While the present paper was under completion, Candes and Tao posted online 
a preprint proving a theorem analogous to 11.21 |CT09] . Once more, their approach is substantially 
different from ours. 

1.5 Open problems and future directions 

It is worth pointing out some limitations of our results, and interesting research directions: 

1. Optimal RMSE with 0{n) entries. Numerical simulations with the Spectral Matrix Com- 
pletion algorithm suggest that the RMSE decays much faster with the number of observations per 
degree of freedom {\E\/nr), than indicated by Eq. ([7]). This improved behavior is a consequence of 
the cleaning step in the algorithm. It would be important to characterize the decay of RMSE with 
{\E\/nr). 

2. Threshold for exact completion. As pointed out. Theorem 11.21 is order optimal for r bounded. 
It would nevertheless be useful to derive quantitatively sharp estimates in this regime. A systematic 
numerical study was initiated in [KMO08]. It appears that available theoretical estimates (including 
the recent ones in [CT09j ) are for larger values of the rank, we expect that our arguments can be 
strenghtened to prove exact reconstruction for \E\ > C {a)nr log n for all values of r. 

3. More general models. The model studied here and introduced in [CR08] presents obvious 
limitations. In applications to collaborative filtering, the subset of observed entries E is far from 
uniformly random. A recent paper |SC09j investigates the uniqueness of the solution of the matrix 
completion problem for general sets E. In applications to fast low-rank approximation, it would be 
desirable to consider non-incoherent matrices as well (as in |AM07] ) . 
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2 Incoherence property and some notations 



In order to formalize the notion of incoherence, we write U = [ui, M2i • • • ) Uy] and V = [vi^V2, ■ ■ ■ ,Vr\ 
for the columns of the two factors, with = y/m, \\vi\\ = y/n and ujuj = 0, vjvj = for i ^ j 
(there is no loss of generality in this, since normalizations can be adsorbed by redefining S). We 
shall further write S = diag(Si, . . . , S,.) with Si > S2 > • • • > > 0. 

The matrices C/, V and S will be said to be {hq, fii)- incoherent if they satisfy the following 
properties: 

Al. For all i G [m], j G [n], we have Ylk=i ^Ik ^ /^o?^, Efc=i ^i,k ^ ^o?"- 
A2. For all i G [m], j G [n], we have | X;i=i Ui^k{^k/'^i)yj,k\ < /iir^/^. 

Apart from difference in normalization, these assumptions coincide with the ones in jCROSj . 

Notice that the second incoherence assumption A2 implies the bounded entry condition in Theo- 
rem [TTT] with Mmax = fJ.ir^^'^- In the following, whenever we write that a property A holds with high 
probability (w.h.p.), we mean that there exists a function f{n) = f{n; a) such that ¥{A) > 1 — f{n) 
and f{n) — > 0. In the case of exact completion (i.e. in the proof of Theorem II. 2p /(•) can also 
depend on fiQ, fii, Smin, Smaxj and /(n) for /^o, /^i, Smin, Smax bounded away from and 00. 

Probability is taken with respect to the uniformly random subset E C [m] x [n]. Define e = 
\E\/y/mn. In the case when m = n, e corresponds to the average number of revealed entries per row 
or column. Then, it is convenient to work with a model in which each entry is revealed independently 
with probability e/y/mn. Since, with high probability \E\ G [ey/an — A^nlogn, €y/an + A^n\og nj, 
any guarantee on the algorithm performances that holds within one model, holds within the other 
model as well if we allow for a vanishing shift in e. 

Notice that we can assume m > n, since we can always apply our theorem to the transpose of 
the matrix M. Throughout this paper, therefore, we will assume a > 1. Finally, we will use C, C 
etc. to denote numerical constants. 

Given a vector x G R", ||x|| will denote its Euclidean norm. For a matrix X G ]R,"^" , 1 1^1 If is its 
Frobenius norm, and \\X\\2 its operator norm (i.e. \\X\\2 = sup^_^o ll^^ll/ll'"ll)- The standard scalar 
product between vectors or matrices will sometimes be indicated by {x,y) or {X,Y), respectively. 
Finally, we use the standard combinatorics notation [A^] = {1,2,..., A^} to denote the set of first 
integers. 



3 Proof of Theorem 11.11 and technical results 

As explained in the previous section, the crucial idea is to consider the singular value decomposition 
of the trimmed matrix instead of the original matrix , as in Eq. ([3]). We shall then redefine 
Wi}, {xi}, {vi}, by letting 

min(rri,n) 

1=1 

Here ||xj|| = \ \yi\ \ = 1, xfxj = yfyj = for i 7^ j and cji > (T2 > • • • > 0. Our key technical result is 
that, apart from a trivial rescaling, these singular values are close to the ones of the full matrix M. 



Lemma 3.1. There exists a numerical constant C > such that, with probability larger than 1 — 1/n^ 



^1 



<CMrne..\l-, (10) 
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where it is understood that S„ = for q > r. 



This result generalizes a celebrated bound on the second eigenvalue of random graphs [FKS89t 
IFUOSj and is illustrated in Fig. [TJ the spectrum of clearly reveals the rank-3 structure of M. 
As shown in Section [5l Lemma |3. II is a direct consequence of the following estimate. 

Lemma 3.2. There exists a numerical constant C > such that, with probability larger than 1 — 1/n^ 

^ -.M -M^ < CM„ 



/run 



(11) 



The proof of this lemma is given in Section HI 
We will now prove Theorem I l.li 



Proof. (Theorem II. ip By triangle inequality 



M -Jr{M^ 



< 



'mn - 



+ 



M 



'mn - 



< y/mna^^xl e + C Mmax ^Jamnj \fe 



amn 



where we used Lemma 13.21 for the second inequality and Lemma [3. II for the last inequality. Now, for 
any matrix A of rank at most 2r, < \/2r||^||2j whence 



1 



/mn ' 



IM - JriM^ 



< 



2r 



'mn ' 



□ 



The result follows by using \E\ = e^fmn. 

4 Proof of Lemma 13.21 



-M)y\ < CMmax\/ae for each x G E*", y G such that 



We want to show that \x'^{M^ — 

1 1^1 1 — \ \y\ \ — 1- Ou'^ basic strategy (inspired by [FKS89] ) will be the following: 

(1) Reduce to x, y belonging to discrete sets Tm, T„; 

(2) Bound the contribution of light couples by applying union bound to these discretized sets, with 

^x'^My: 



a large deviation estimate on the random variable Z, defined as Z = Y^j^ XiMf^yj — 
(3) Bound the contribution of heavy couples using bound on the discrepancy of corresponding graph. 

The technical challenge is that a worst-case bound on the tail probability of Z is not good enough, 
and we must keep track of its dependence on x and y. The definition of /ight and heavy couples is 
provided in the following section. 

4.1 Discretization 

We define 



X < 1 
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Notice that T„ C Sn = {x e : ||x|| < 1}. Next remark is proved in |FKS89l IFUOSj . and relates 
the original problem to the discretized one. 

Remark 4.1. Let R £ E"^^" be a matrix. If \x'^Ry\ < B for all x £ Tm and y £ Tn, then 
\x'^Ry'\ < (1 - A)-2s for all x' £ and y' £ Sn- 



--M)y\ < CMra^,^^ for 



Hence it is enough to show that, with high probability, |x^(M^ — 
all X £ Tm and y £ Tn- 

A naive approach would be to apply concentration inequalities directly to the random variable 
x^{M^ — ---^=M)y. This fails because the vectors x, y can contain entries that are much larger than 

the typical size 0(n~^/^). We thus separate two contributions. The first contribution is due to light 
couples L C [m] x [n], defined as 



L 



e \i/2 

{i,j) : \xiMijyj\ < M^ax ( 



The second contribution is due to its complement L, which we call heavy couples. We have 



'mn 



--M]y 



< 



-^=x'^My 
/mn 



+ 



(12) 



In the next two subsections, we will prove that both contributions are upper bounded by CMmaxVoe 
for ah x £Tm, y £ r„. Applying RemarkOto \x'^{M^ - -^M)y\, this proves the thesis. 



4.2 Bounding the contribution of light couples 

Let us define the subset of row and column indices which have not been trimmed as Ai and Ar ■ 



Ai 



{i £ [m] : deg(z) < — L} , 



{j £ [n] : deg(j) < 2e^} 



where deg(-) denotes the degree (number of revealed entries) of a row or a column. Notice that 
A = {Ai,Ar) is a function of the random set E. It is easy to get a rough estimate of the sizes of Ai, 

Remark 4.2. There exists Ci and C2 depending only on a such that, with probability larger than 
1 — 1/n^, |^;| > m — max{e~'-''^''m, C201}, and \Ar\ > n — maxje^'^^'^n, C2}. 

For the proof of this claim, we refer to Appendix [XI For any E C [m] x [n] and A = (Ai,Ar) 
with Ai C [m], Ar C [n], we define M^'^ by setting to zero the entries of M that are not in E, those 
whose row index is not in Ai, and those whose column index not in A^. Consider the event 



n{E,A) = < 



3x,y 



(i,i)GL 



■^=x'^My 
/mn 



> CM„ 



(13) 
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where it is understood that x and y belong, respectively, to Tm and T„. Note that = M^'-^, 
and hence we want to bound F{Ti.{E,A)}. We proceed as follows 

F{n{E,A)} = ^F{n{E,A), A = A} 

A 

< Yl ^mE,A),A = A} + l 



\Ai\>m{l-5), 
\Ar\>nil-S) 



< 2("+'"W^) max F {n(E; A)} + ^ , (14) 
" \Ai\>m(l-S), n'^ 

\Ar\>n{l-5) 



with 6 = max{e~'^i'^, C2Q;} and H{x) the binary entropy function. 

We are now left with the task of bounding ¥ {7i{E; A)} uniformly over A where TC is defined as 
in Eq. (113p . The key step consists in proving the following tail estimate 



Lemma 4.3. Let x G Sm, y e Sn, Z = ^^(jj)^^ a^jM^^'^T/j - -^—x'^My, and assume \Ai\ > m{l-6), 
\Ar\ > n(l — 5) with 5 small enough. Then 

^Ja{L — 3)n ' 



[Z > LM^^^Ve) < exp { - V^ii_ZJ^} . 



Proof. We begin by bounding the mean of Z as follows (for the proof of this statement we refer to 
Appendix [B]) . 

Remark 4.4. |E [Z]\ < 2Mmax\/e. 

For A = {Ai,Ar), let be the matrix obtained from M by setting to zero those entries whose 
row index is not in Ai, and those whose column index not in Ar. Define the potential contribution 
of the light couples Uij and independent random variables Zij as 

^ \ x^M^^yj if |x,M^yj| <Mmax(eMn)^/^ 
1 otherwise, 



Zij 



Oij w.p. e/^ymn, 
w.p. 1 — ej ^Jmn^ 



Let Zi = Y.^,J Zij so that Z = Z^- ^x' My. Note that af^ < E^,J [xiM.jy^j < M^.^^. Fix 
A = yj mn /2 A/max \/e so that |Aaij| < 1/2, whence e'*'"'^ — 1 < Aojj + 2(Aaij)^. It then follows that 

E[e^^] = exp|^(VAa,,, + 2V(Aa„f) -^x^Myj 

< exp{AE[Z] + ^}. 

The thesis follows by Chernoff bound P(Z > a) < e~'^"E[e^^] after simple calculus. □ 

Note that ¥{—Z > LMniax\/e) can also be bounded analogously. We can now finish the upper 
bound on the light couples contribution. Consider the error event Eq. (|13p . A simple volume 
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calculation shows that \Tm\ < (lO/A)™. We can apply union bound over Tm and T„ to Eq. to 
obtain 



_ ((7 — 3) van 



< exp <i log 2 + (1 + a) {H{5) log 2 + log(20/A)) n - ^'^ ^ I + A ■ 



Hence, assuming a > 1, there exists a numerical constant C such that, for C > C ^/a^ the first term 
is of order e"®*^"'^ and this finishes the proof. 

4.3 Bounding the contribution of heavy couples 

Let Q be an m X n matrix with Qij = 1 if («, j) G E and i J (i-e- entry is not 
trimmed by our algorithm), and Qij = otherwise. Since \Mij\ < M^ax, the heavy couples satisfy 
{xiUjl ^ V ^/fnn. We then have 



(*,i)er 



(i,i)ei: 



{i,j)&E: 



\xiyj\>yJTJmjn 

Notice that Q is the adjacency matrix of a random bipartite graph with vertex sets [m] and [n] 
and maximum degree bounded by 2emax(Q^/'^, a~^/^). The following remark strengthens a result of 
[FQ05] . 

Remark 4.5. Given vectors x, y, let L = {{i,]) : \xiyj\ > C^Je/mn}. Then there exist a constant 
C such that, J2^ij)(zL' Qij\ 

^iVjl ^ C'{^/a-\- -^)^/e, for all x G T^,, y Tn with probability larger 

than 1 - l/2n^. 

For the reader's convenience, a proof of this fact is proposed in Appendix [Cl The analogous result 
in |FO05j (for the adjacency matrix of a non-bipartite graph) is proved to hold only with probability 
larger than 1 — e"*^*^. The stronger statement quoted here can be proved using concentration of 
measure inequalities. The last remark implies that for all x € T^, y £ Tn, and a > 1, the contribution 
of heavy couples is bounded by CMmax\/«e for some numerical constant C with probability larger 
than 1 - l/2n^. 

5 Proof of Lemma 13.11 

Recall the variational principle for the singular values. 

cr„ = min max ||M'^y|| (15) 

= max min ||Af^w||. (16) 

H,dimiH)=q yeH,\\y\\=^ 

Here H is understood to be a linear subspace of R". 
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Using Eq. (|15p with H the orthogonal complement of span(t'i, . . . , Vq-i), we have, by Lemma [3? 

a„ < max ||M^y|| 

?/e^^,ll2/ll=i " " 



< —^= I max 1 1 My 1 1 ) + max 



x'^ { ] y 

'mn 



/mn Vi/G//,||i/||=i " '7 2ye//,||»/||=||x||=i 
< eSg + CMmax\/ai 

The lower bound is proved analogously, by using Eq. with H = span(vi, . . . ,Vq). 

6 Minimization on Grassmann manifolds and proof of Theorem 11^2] 

The function F[X, Y) defined in Eq. ^ and to be minimized in the last part of the algorithm 
can naturally be viewed as defined on Grassmann manifolds. Here we recall from |EAS99j a few 
important facts on the geometry of Grassmann manifold and related optimization algorithms. We 
then prove Theorem II. 2 [ Technical calculations are deferred to Sections [71 El and to the appendices. 

We recall that, for the proof of Theorem \1.2\ it is assumed that Smin, Smax are bounded away 
from and oo. Numerical constants are denoted by C,C' etc. Finally, throughout this section, we 
use the notation X*^*) G to refer to the i-th row of the matrix X £ W^^^ or X G B," 



}nxr 



6.1 Geometry of the Grassmann manifold 

Denote by 0{d) the orthogonal group oi d x d matrices. The Grassmann manifold is defined as the 
quotient G(n, r) ~ 0(n)/0(r) x 0(n — r). In other words, a point in the manifold is the equivalence 
class of an n X r orthogonal matrix A 

[A] ={AQ:Qg 0(r)} . (17) 

For consistency with the rest of the paper, we will assume the normalization A'^A = n 1. To represent 
a point in G(n, r), we will use an explicit representative of this form. More abstractly, G(n,r) is the 
manifold of r-dimensional subspaces of R". 

It is easy to see that F{X, Y) depends on the matrices X, Y only through their equivalence 
classes [X], [Y]. We will therefore interpret it as a function defined on the manifold M(m,n) = 
G(m, r) X G{n, r): 

F:U{m,n) R, (18) 

{[Xim) - F{X,Y). (19) 

In the following, a point in this manifold will be represented as a pair x = (X, y), with X an n x r 
orthogonal matrix and y an m x r orthogonal matrix. Boldface symbols will be reserved for elements 
of M(m,n) or of its tangent space, and we shall use u = {U,V) for the point corresponding to the 
matrix M = UTiV'^ to be reconstructed. 

Given x = {X, Y) G M(m, n), the tangent space at x is denoted by Tx and can be identified with 
the vector space of matrix pairs w = (W,Z), W G R'"^'', Z G R'"'"' such that W'^ X = Z'^Y = 0. 
The 'canonical' Riemann metric on the Grassmann manifold corresponds to the usual scalar product 
{W,W') = Tr(I^^l^')- The induced scalar product on Tx between w = {W, Z) and w' = {W',Z') 
is (w,w') = {W,W') + {Z,Z'). 

This metric induces a canonical notion of distance on M(m,n) which we denote by d(xi,X2) 
(geodesic or arc-length distance). If xi = {Xi,Yi) and X2 = {X2,Y2) then 



(i(xi, X2) = Vd(Xi,X2)2 + d(yi,y2)2 (20) 
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where the arc-length distances X2), (i(Yi, I2) on the Grassmann manifold can be defined explic- 

itly as follows. Let cos9 = {cos9i, . . . ,cos9r), Oi € [— 7r/2,7r/2] be the singular values Xj^ X2/m. 
Then 

d{X^,X2) = \\9\\2. (21) 

The 9iS are called the 'principal angles' between the subspaces spanned by the columns of Xi and 
X2. It is useful to introduce two equivalent notions of distance: 

dc{Xi^X2) = —p^ min I l^iQi — ^2Q2| If (chordal distance), (22) 

VnQi,Q2eO(r) 

dp{Xi, X2) = f- \\X\X\ — X2Xl2^\\p (projection distance). (23) 

V 2n 

Notice that and dp do not depend on the specific representatives X\^ X2, but only on the equiv- 
alence classes [Xi] and [X2]. Distances on M(m, n) are defined through Pythagorean theorem, e.g. 

dc(xi,X2) = ^d,{Xi,X2Y+d,{Yi,Y2Y. 

Remark 6.1. The geodesic, chordal and projection distance are equivalent, namely 

-d{Xi,X2) <^d,{Xi,X2) < dp(Xi,X2) < d,{Xi,X2) < d{Xi,X2). (24) 

For the reader's convenience, a proof of this fact is proposed in Appendix [Dl 
An important remark is that geodesies with respect to the canonical Riemann metric admit an 
explicit and efficiently computable form. Given u G M(m,n), w G Tu the corresponding geodesic 
is a curve t ^ x(t), with x(t) = u + wt + OitP') which minimizes arc-length. If u = {U,V) and 
w = {W,Z) then x(t) = {X{t),Y{t)) where X{t) can be expressed in terms of the singular value 
decomposition W = LQR^ ^EAS99] : 

X{t) = UR cos{et)R^ + L sm{et)R^ , (25) 

which can be evaluated in time of order 0{nr). An analogous expression holds for Y{t). 

6.2 Gradient and incoherence 

The gradient of F at x is the vector gradi<'(x) G Tx such that, for any smooth curve t 1— > x(t) G 
M(m, n) with x(t) = x + wt + 0{t^), one has 

F(x(t)) =F(x) + (gradF(x),w)t + 0(t2). (26) 

In order to write an explicit representation of the gradient of our cost function F, it is convenient to 
introduce the projector operator 

The two components of the gradient are then 

gradF(x)x = Ve{XSY'^ -M)YS'^ -XQx, (28) 
gradF(x)y = Ve{XSY^ - M^XS -YQy , (29) 

where QxiQy G E.^^'' are determined by the condition gradF(x) G Tx. This yields 

Qx = —X^Ve{M-XSY^)YS^, (30) 
m 

Qy = -Y^Ve{M - XSY^fXS . (31) 
n 
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6.3 Algorithm 

At this point the gradient descent algorithm is fuhy specified. It takes as input the factors of Tr{M^), 
to be denoted as xq = {Xq, Yq), and minimizes a regularized cost function 

F{X,Y) = F{X,Y)+pG{X,Y) (32) 

. F,X,y)+p|:G,^LXj,,gG.^L_ij, (33) 

where X^^^ denotes the i-th row of X, and y(^) the j-th row of Y. The role of the regularization is 
to force X to remain incoherent during the execution of the algorithm. 

O'(^) ={"-)■-! «:>;: 

We will take p = ne. Notice that G{X, Y) is again naturally defined on the Grassmann manifold, 
i.e. G{X,Y) = G{XQ,YQ') for any Q,Q' e 0(r). 
Let 

/C(/i') = {(X,y) such that < p'r, WY'-^^W^ < p'r for all i e [m], j e [n]} . (35) 

We have G{X,Y) = on /C(3^o)- Notice that u G IC{po) by the incoherence property. Also, by the 
following remark proved in Appendix [Dl we can assume that xq € IC{3po). 

Remark 6.2. Let U,X £ R"^'' with U^U = X^X = nl and U G IC{pq) and d{X,U) < S < ^. 
Then there exists X" G R"^'' such that X"^X" = nl, X" G /C(3/io) and d{X",U) < A6. Further, 
such an X" can be computed in a time of 0{nr^). 

Gradient descent( matrix M'^, factors xq ) 



For A; = 0,1,... do: 
Gompute = gradF(x/c); 

Let 1 1— > Xfc(t) be the geodesic with Xfc(t) = x^ + Wfct + 0{t'^); 
Minimize 1 1— > F{xk(t)) for t > 0, subject to (i(xfc(t),xo) < 7; 
Set Xfc+i = Xfc(tfc) where is the minimum location; 
End For. 



In the above, 7 must be set in such a way that d{u, xq) < 7. The next remark determines the 
correct scale. 

Remark 6.3. Let U,X £ E"^'' with U'^U = X^X = ml, V,Y £ E"^'' with V'^V = Y^Y = nl, 
and M = UY^V^ , M = XSY^ for S = diag(Si, . . . , S^) and S £ W'' . // Si, . . . , > Smin, then 

d^{U,X)< ^ \\M-M\\f , dp(y,F)< ^ \\M-M\\f (36) 



As a consequence of this remark and Theorem ll.il we can assume that d(u, xq) < C($r^) iill^. 

•^min y e 

We shall then set 7 = C'{^^) ^^^^^ (the value of C' is set in the course of the proof). 

Before passing to the proof of Theorem 11.21 it is worth discussing a few important points con- 
cerning the gradient descent algorithm. 
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(i) The appropriate choice of 7 might seem to pose a difficulty. In reality, this parameter is 
introduced only to simplify the proof. We will see that the constraint (i(xfc(t), xq) < 7 is, with 
high probability, never saturated. 

(a) Indeed, the line minimization instruction 5 (which might appear complex to implement) can 
be replaced by a standard step selection procedure, such as the one in [Arm66j . 

(iii) Similarly, there is no need to know the actual value of /Uq in the regularization term. One can 
start with fiQ = 1 and then repeat the optimization doubling it at each step. 

(iv) The Hessian of F can be computed explicitly as well. This opens the way to quadratically 
convergent minimization algorithms (e.g. the Newton method). 

6.4 Proof of Theorem [112 

The proof of Theorem 11.21 breaks down in two lemmas. The first one implies that, in a sufficiently 
small neighborhood of u, the function x 1-^ F{x) is well approximated by a parabola. 

Lemma 6.4. There exists numerical constants Co,Ci,C2 such that the following happens. Assume 
e > Co|J-o^/ar ma,x{logn; f^or^/a{T.^l,^/T.mm)^} and 5 < S^j^/CoSmax- Then 

CiV^S^i^d(x,u)2 + CiV^||S-S|||. < — F(x) < C72V^sLxC^(x,u)2 (37) 

ne 

for all X E M(m, n) n/C(4/xo) such that (i(x, u) < 5, with probability at least 1 — Here S S R''^^ 

is the matrix realizing the minimum in Eq. 

The second Lemma implies that x 1-^ F(x) does not have any other stationary point (apart from 
u) within such a neighborhood. 

Lemma 6.5. There exists numerical constants Co,C such that the following happens. Assume 
e > Co/io''\/a(Smax/Smm)^niax{logn;/xor^(i;max/Smin)'^} and 6 < Smin/CoSmax- Then 

||gradF(x)|p > Cne^S^j^ (i(x, u)^ 
for all X G M(m, n) n /C(4^o) such that (i(x, u) < 6, with probability at least 1 — 1/n^. 
We can now prove Theorem ll.2l 

Proof. (Theorem II. 2 1) Let 5 > be such that Lemma [6. 41 and Lemma l6.5l are verified, and Ci, C2 be 
defined as in Lemma |6.4[ We further assume 6 < \l (e^/^ — 1)/C2. Take e large enough such that, 
d(u,xo) < min(l, (Ci/C2)"'^/^(5]mm/Smax))VlO- Further, set the algorithm parameter to 7 = 5/4. 
We make the following claims: 

1. Xfe G /C(4/io) for all k. 

Indeed xq G /C(3/io) whence i^(xo) = -P'(xo) < C2\/oLrLeY2^^-^8'^ . The claim follows because 
F(xfc) is non-increasing and F(x) > pG(X^Y^ > ne^/aT,'^^^{e^^^ — 1) for x /C(4/io), where 
we choose p to be ney^S^j^^- 

2. d(xfc,u) < 5/10 for all k. 

Since we set 7 = <5/4, by triangular inequality, we can assume to have d{xk,u) < 6/2. Since 
d(xo,u)2 < (C7lE^J^/C2S^ax)('^/10)^ we have ^(x)^> F(x) > F(xo) for all x such that 
d{x,u) G [6/10,6]. Since -F(xfc) is non-increasing and F{xq) = -F(xo), the claim follows. 
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Notice that, by the last observation, the constraint (i(xfc(f),xo) < 7 is never saturated, and 
therefore our procedure is just gradient descent with exact line search. Therefore by |Arm66] this 
must converge to the unique stationary point of F in /C(4//o) H {x : (i(x, u) < (5/10}, which, by 
Lemma [631 is u. □ 



7 Proof of Lemma 16.4 



7.1 A random graph Lemma 

The following Lemma will be used several times in the following. 

Lemma 7.1. There exist two numerical constants Ci,C2 suet that the following happens. If e > 
Ci logn then, with probability larger than 1 — 

Xj^j < ||x||i||y||i + C2\/ae||2;||2 II2/II2 • (38) 

for all X € R™, y € R". 

Proof. Write Xi = xq + x[ where = 0- Then 

^ Xiyj = xo ^ deg(i)yj + ^ x-j/j , (39) 

(i,j)e-B je[n] {i,j)(iE 

where we recall that deg(j) = {i G [m] : such that {i,j) £ E}. Further |xo| = | "^iXi/ml < ||x||i/m. 
The first term is upper bounded by 

xomaxdeg(j)||y||i < maxdeg(j)||x||i||y||i/m . (40) 

j€n j&n 

For e > Cilogn, with probability larger than 1 — l/2n^, the maximum degree is bounded by 
(9/Ci)-y/ae which is of same order as the average degree. Therefore this term is at most C2y/ae\ |x| |i| |y| |i 

The second term is upper bounded by C2-v/ae||x'||2||y||2 using Theorem 1.1 in [FQ05| or, equiv- 
alently, Theorem 13.11 in the case r = 1 and Mmax = 1- It can be shown to hold with probabil- 
ity larger than 1 — l/2n^ with a large enough numerical constant C2. The thesis follows because 
Ik'lb < \\x\\2- □ 

7.2 Preliminary facts and estimates 

This subsection contains some remarks that will be useful in the proof of Lemma 16.51 as well. 

Let w = {W,Z) £ Tu, and t ^ {X{t),Y{t)) be the geodesic such that {X{t),Y{t)) = {U,V) + 
{W, Z)t+0{tP'). By setting [X, Y) = (X(l), 1^(1)), we establish a one-to-one correspondence between 
the points x as in the statement and a neighborhood of the origin in Tu. If we let W = LQR^ be 
the singular value decomposition of W (with L^L = ml and R^R = 1), the explicit expression for 
geodesies in Eq. ([25]) yields 

X = U + W, W=UR{cose-l)R^ + LsmeR^ . (41) 

An analogous expression can obviously be written for Y = V + Z. Notice that, by the equivalence 
between chordal and canonical distance. Remark 16. H we have 

-\\W\\l + -\\Z\\l<2d{n,xf . (42) 
m n 
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Remark 7.2. Ifu G IC{no) andx£ /C(4^o), then {W,Z) e IC{Wfio) andw = (W,Z) £ /C(57rVo/2). 

Proof. The first fact follows from \\W^'^\\^ < 2||X»||2 + 2||C/W|p. In order to prove w G /C(57rVo/2), 
we notice that 

= ||GL«||2<^||sineL»||2 

< -i?cosGii^C/«||2 < y + ||C/«|p) . 

The claim follows by showing a similar bound for HZ*-*)!!^. □ 

We next prove a simple a priori estimate. 

Remark 7.3. There exist numerical constants Ci,C2 such that the following holds with probability 
larger than 1 — If e > Ci logn, then for any {X,Y) G /C(/i) and S G , 

Y: {XSY^)% < CWSWlV^r^e f + '-\\Y\\l) f + + ^) . (43) 

(.j^^ \m n J \m n J 

Proof. Using Lemma [TT] j)eE^-X^^'^)'ij upper bounded by 



a,b {i,j)&E 



< C2\\S\\lV^ne(-\\X\\j, + -||y|||.)' + C2\\S\\lai^rnV~e(-\\X\\l + , 



\m n / \m n 

where in the second step we used the incoherence condition. The last step follows from the inequalities 
2ab < a{a/a + 6)^ and 2ab < ^{a^ja + 6^). □ 

7.3 The proof 

Proof. (Lemma l6.4p Denote by S G R''^'" the matrix realizing the minimum in Eq. We will start 
by proving a lower bound on -F(x) of the form 

— F(x) > CiV^S^i„(i(x,u)2 + CiV^||5-S|||.-C7;V^Sma,d(x,u)2||S-S||^, (44) 
ne 

and an upper bound as in Eq. ([37]). Together, for (i(x, u) < < 1, these imply US — < 
CS^3^x^^(x5 u)^, whence the lower bound in Eq. ([37|) follows for 5 < Smin/CoSmax- 
In order to prove the bound (f44l) we write X = [/ + VF, y = y + Z, and 



F(x,y) = ^ XI (f^C-s - s)y^ + c/5z^ + lysy^ + iy5z'^)2^- 



2 

> 1^2 _ 1^2 

- 4 2 
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where we used the inequahty (l/2)(a + 6)2 > (oV4) - and defined 

Using Remark \7.2>\ and Eq. ([i2|l we get 

< Cy/^ne \\S\\l (^d(x, u)^ + d(x, u)^ 

where the second inequality follows from the inequality crmax(5')^ < ^S^^^^ + 2 US' — 

By Theorem 4.1 in |CR08j . we have > {1/2)E{A'^} with probability larger than 1 - for 
e > C/io\/a^loS^- Further 

E{>i2} = -^\\u(s -j:)v^ + usz^ + wsv^\\'i 

^Jmn 

- ' ._\\u{s-i:)v^\\l + ^\\usz^\\l + ^\\wsv^\\l 



'mn \/mn \/mn 

2e 



y/mn y/mn ^mn 

Let us call the absolute value of the six terms on the right hand side . . . Eq. A simple calculation 
yields 

El = neV^WS -mj,, (45) 
E2 + E3 > neV^amin(S)2fl||W||^ + -||Z|||) >C'Vm(5)2neV^(i(x,u)2. (46) 

The absolute value of the fourth term can be written as 

Ei = -^\{USZ^,WSV^)\<^ar^US?\\W^U\\F\\V^MF 



n\ a n\ a 



< ^a^.^isfi^ww^uwl + wv^'zwl)- 



In order proceed, consider Eq. (|41|) . Since by tangency condition U'^ L = 0, we have U'^W = 
mR{cosQ — 1)R^ whence 

\\U'^W\\F = m\\ cos e-l\\ = y ||4sin2(0/2)|| < y 1 12 sin(0/2)| ^ (47) 

(here 6 = {9i, . . . ,0r) is the vector containing the diagonal elements of 0). A similar calculation 
reveals that m| |2 sin(0/2)| p thus proving \\U^W\\l < ||W||^/4 < Cm6^\\W\\l. The bound 

< Cn5'^\\Z\\'^p is proved in the same way, thus yielding 

E^ < Cney/aa^^^{Sf6^d{jc,uf. (48) 
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By a similar calculation 



^5 = ^T,{{S-T.)S'W^U}<^ara..{{S-^)S')\\W'U\\F 



< neVao"max(5')||5 - S||j7'(i(u,x)2 . 

and analogously 

Eg < ne^/aami,x{S)\\S - S| |i7'(i(u, x)^ . 
Combining these estimates, and using > E{j4^}/2, we get 

ne 

-C2y/acr^i,x{S)'^5^d{u,x.f - C2\/acrinax(5') ||5 - mFd{u,xf 

for some numerical constants Ci, C2 > 0. Using the bounds (JramiS)^ > T,'^^^/2 — ||5 — 
Cmax('S')2 < 2T,ma.x + ^ ||»S' — and the assumption d(x, u) < for 5 < Smm/CoSmax, we get the 
claim 



We are now left with the task of proving the upper bound in Eq. (|37j) . We can set S = S, thus 
obtaining 

F{x,Y) < ^ ^ (c/sz^ + T^sy^ + 



where we defined 



Bounds for these two quantities are derived as for and B^. More precisely, by Theorem 4.1 in 
|CR08] . we have A'^ < 2E{A'^} with probability at least 1 - 1/n^ and 

E{A^} = — + 



^\wj:v^\\1 + ^\\uy;z^\\1 



< 2^ne^iJ-\\W\\l + -\\Z\\l) <4V^neSLx4x,u)2. 
\m n / 

i?2 is bounded similar to i?^ and we get, 

□ 
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8 Proof of Lemma 16.51 

As in the proof of Lemma 16.41 see Section 17.21 we let t x(t) = {X{t),Y{t)) be the geodesic 
starting at x(0) = u with velocity x(0) = w = {W, Z) e Tu. We also define x = x(l) = (X, Y) with 
X = U + W and Y = V + Z . Let w = x(l) = (W, Z) be its velocity when passing through x. An 
explicit expression is obtained in terms of the singular value decomposition of W and Z. If we let 
W = LQR^, and differentiate Eq. ([25]) with respect to t at t = 1, we obtain 



W = -URQsmeR^ + LecosQR^ . (49) 

An analogous expression holds for Z. Since L^U = 0, we have | |Ty 1 = m| |0 sin Q\ ||;.+m| |0 cos 0| = 
m||0|p. Henc^ 

^\\W\\l + -\\Z\\l = d{^,u)\ (50) 
m n 

In order to prove the thesis, it is therefore sufficient to lower bound (grad F(x), w). In the following 
we will indeed show that 

(gradF(x),w) > Ca/o neS^jn (i(x, u)^ , 

and (gradG(x),w) > 0, which together imply the thesis by Cauchy-Schwarz inequality. 
Let us prove a few preliminary estimates. 

Remark 8.1. With the above definitions, w G /C((ll/2)7r^/io). 

Proof. Since = diag(6'i, . . . ,6r) with \6i\ < tt/2, we get 

2 

WW^'^W^ < 2||Gsinei?^f/W||2 + 2||ecosGL«||2 < 1-||[/W||2 + 2||PF»||2 . (51) 

By assumption we have < fiQr and by Remark 17.21 we have ||T4^(*)||2 < 57rVo?'/2. □ 

One important fact that we will use is that W is well approximated by W or by W, and Z is 
well approximated by Z or by Z. Using Eqs. ([^T|) and (jl9]) we get 



I^IIf = = m\\e\W (52) 

\W\\l = m||2sin^/2||2, (53) 



{W,W) = m^OaSmea, (54) 

a=l 
r 

{W,W) = m^elcosea, (55) 

o=l 

and therefore 

r 

\\W-W\\l = mY,[{'2sm{ej2)f + el-2eaSmea] (56) 

a=l 

< mj2i0a-2sm{ej2)f<^\\e\\'<^diu,^)'. (57) 

a=l 



■^Indeed this conclusion could have been reached immediately, since t i— > x{t) is a geodesic parametrized proportion- 
ally to the arclength in th interval t £ [0, 1]. 
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Analogously 

r 

\\W -WW'^p = n^[2el-2elcosea]<m\\e\\'^ <md{vi,iif , (58) 

a=l 

where we used the inequlity 2(1 — cosx) < x^. The last inequality implies in particular 

\\U^W\\f = \\U^{W -W)\\f <md(u,x)2. (59) 

Similar bounds hold of course for Z/Z,Z (for instance we have HF^ZHj? < n(i(u,x)^). Finally, we 
shall use repeatedly the fact that US' — < CYi'^^^d{'x., u)^, which follows from Lemma 16.41 This 
in turns implies 

a-max(5') < Smax + C* Smaxd(x, u) < 2 Sinax , (60) 
<7mm(5') > Smin - C' SmaxC^(x, u) > - Smin , (61) 

where we used the hypothesis d(x, vl) < 5 = Smin/C'o^^max- 
8.1 Lower bound on gradF(x) 

Recalling that Ve is the projector defined in Eq. ([271) . and using the expression ([28|) . (p9]) . for the 
gradient, we have 

(gradF(x), w) = {Ve{XSY^ - M), {XSZ^ + WSY^)) 

= {Ve{U{S - s)y^ + usz'^ + wsv^ + wsz"^), {USZ^ + wsv^ + Wsz^ + wsz'^)) 

>A-Bi-B2-B3 (62) 
where we defined 

A = {Ve{USz'^ + WSV^),{USZ^ + WSV^)) , 
Bi = \{Ve{USZ^ + WSV^),(WSZ'^ + WSZ^))\, 
B2 = \{Ve{U{S -'^)V^ + WSZ'^),{USZ^ + WSV^))\, 
^3 = \{Ve{U{S -Y.)V^ + WSZ'^),(^SZ'^ + WSZ^))\. 

At this point the proof becomes very similar to the one in the previous section and consists in lower 
bounding A and upper bounding i?2, B^. 

8.1.1 Lower bound on A 

Using Theorem 4.1 in |CR08j we obtain, with probability larger than 1 — 1/n^. 

A > —^{{USZ'^ +WSV^),{USZ^ + WSV^)) 
2\/mn 



where 



Ao = -—=\\USZ+WSV^\\l, 
2^ ran 



Bo = ! msz'^ + WSV'^\\f\\US(Z - Zf + (W- W)SV^\\f . 

2\/mn 
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The term is lower bounded analogously to E{^^} in the proof of Lemma 16.41 see Eqs. (j46p and 
(gSj): 

Aq = —^\\usz'^ + wsv^\\l 

^ 'psz'^Wl + ^=\\wsv^\\l + ^^{usz'^ ,WSV^) 



2^mn 2^ ran 2^pmn 

> Cne{^/aaJnm{S)^ - \/a.5^ a^s.^{Sf)d{y., \if > Cy/ane S^jn d(x, u)^ , 

where we used the bounds (f60]) . (fHTI) and the hypothesis d(x, u) < 5 = Smin/C'oSmax- 
As for the second term we notice that 

M < neV^(-\\S{W-W)\\l + -\\S^{Z-Z)\\l] (63) 



^0 ^ 

< ne^a^ax(5)2(l||Pr-t?||| + -||Z-Z||^) <(:7neV^sLx'^(x,u)^ (64) 
\m n J 

where, in the last step, we used the estimate (f57|) and the analogous one for \\Z — Z\\p. Therefore 
for (i(x, u) < 6 < Emin/CoSmax and Co large enough Aq > 2Bq, whence 

A > C-v/ane S^jjj (i(x, u)^ . (65) 

8.1.2 Upper bound on Bi 



We begin by noting that Bi can be bounded above by the sum of four terms of the form B[ 

e{USz'^),WSZ^)\. Wesho 
Using Remark 1 7. 3 1 we have 



\{'Pe{USZ^),WSZ'^)\. We show that B[ < A/100. The other terms are bounded similarly. 



\\VEiWSZ')\\j, < C^=\\W\\U\SZ'\\i + C'^efiorV^^m..\\W\\F\\SZ\\F 

< 2C^=\\W\\l\\SZ^\\l + 2C^=\\W\\l\\SiZ -ZfWj, 

+C'^/^/iOr^/^S^ax||W||ir||5Z||i. + C'V^/iorV^S^ax|F||i.||5(Z - Z)||i. 

< C'Aib''^ 



r 



rp 

where we have used -^\\SZ |||, < 3^n < 12^ from Section IHXTl Therefore we have, 

^mn II 1 1 — ^ — ' ' ' 

B'^ < \\VEiJJSz')\\l\\VE(WSZ^)\\l 



mn \ ^/& 

< C'A^fd 



m 

The thesis follows for 5 and e as in the hypothesis. 
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8.1.3 Upper bound on B2 

We have 

B2 < liVEiusz^ + wsv^),Wsz'^)\ + \{rE{usz^),u{s -^)v^)\ 
+\{Ve{wsv^),u{s -j:)v'')\ 

= B'2 + B'^ + B'^' . 

We claim that each of these three terms is smaher than A/ 30, whence B2 < A/ 10. 
The upper bound on B2 is obtained similarly to the one on Bi to get B2 < A/ 30. 
Consider now B'^. By Theorem 4.1 in [CR08], 



^2 < -^\{uis-m^,usz^)\ + c^. r'''''''^}?^'' \\u{s-j:)v^\\F\\usz\\F 



To bound the second term, observe 

WUSZ^Wf < WUSZ^Wf + \\US{Z -^f\\F 

< \\USZ'^\\p + Smax\/"1'^ u)^ 

Also, -^\\USZ III < 3^0 < 12A from SectionEXH Combining these, we have that the second 
term in B2 is smaller than A/60 for e as in the hypothesis. 
To bound the first term in B2, 

\{u{s -^)v^,usz^)\ = \{u{s -j:){y -vf,usz^)\ 

< \\u{s - s)z^||i.||c/5z^||F + \\u{s - j:)z'^\\f\\us(z - Z\\f 

Therefore 

B2 < -^=\\U{S -J:)Z\\F\\USZ\\F + en^/^^'i^^d{x,uf + A/60 



Jmn 

< -^\\U(S -J:)Z\\f\\USZ\\f + A/40 

for d(x, u) < (5 as in the hypothesis. 

We are now left with upper bounding B2 = —^\\U{S — Yj)Z\\f\\U SZ\\f . 



B'^ < {^\\uis-j:)z^\\U(^\\usz^\\l 

'mn I \ Jmn 



< {enV^^l,^dix,u)'')(^\\USZ^\\l) 



Also from the lower bound on A, we have, ^^^ \\U S Z |||, < 3Aq < 12A. Using d{x,u) < 6, we 

have B2 < A/120 for 6 as in the hypothesis. This proves the desired result. The bound on B2' is 
calculated analogously. 
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8.1.4 Upper bound on 

Finally for the last term it is sufficient to use a crude bound 

B3 < 4(117^^5(^5^^)11^ + \\VEiWSz'^)\\F) [WVEiUiS - j:)v^)\\f + \\Ve(WSZ^)\\f 

The terms of the form | |'P£;(VFS'Z^)| 1^? are all estimated as in Section [8.1.2[ Also, by Theorem 4.1 
of [CR08] 

\\Ve{U{S-^)V^)\\f < C^||C/(S-S)y^||| 

\/mn 



< CneVaS^a^d(x, u)^ 
Combining these estimates with the 6 and the e in the hypothesis, we get < A/W 

8.2 Lower bound on gradG(x) 

By the definition of G in Eq. ()33p . we have 

(gradG(x),w) = — E ^1 + — E {Y^^Z^''^) ■ (66) 

It is therefore sufficient to show that if > 3^0?-, then (XW,t?(^)) > 0, and if > 

3^0?") then {Y^^\ Z^^^) > 0. We will just consider the first statement, the second being completely 
symmetrical. 

From the explicit expressions (I41|) and ()49p we get 



X« = i?|cosei?^C/(*) + sineL«} , (67) 
= i?|ecoseL(*) - esinei?^C/(*)} . (68) 
From the first expression it follows that 

llsinGL^^^lP < + WcosQ R^U^'^\\^ < 5 /Uor . 

On the other hand, by taking the difference of Eqs. (j67p and (|68p we have 

< ||(sine-ecose)LW|| + ||(cosG + esine)-R^C/(^)|| 

< max{ef)\\smeL^'^\\ + ^\\U^'^\\ < 5./IiI^ + . 

i 2 2 

where we used the inequality (sinw — locoslo) < cj^sinw valid for co G [0, 7r/2]. For 6 small enough 
we have therefore \ \X'^^ - W^'^W < (99/100) ^3^^^. To conclude, for ||X«|| > 3^0^- 

> - ||X«|| -t?»|| > 11X^11(73^- (99/100) ys;^^) >0. 
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A Proof of Remark 14.21 



The proof is a generalization of analogous result in [FQ05j , which is proved to hold only with prob- 
ability larger than 1 — e~^''. The stronger statement quoted here can be proved using concentration 
of measure inequalities. 

First, we apply Chernoff bound to the event ||w4;| > max{e~'^^'^m, C2a}|. In the case of large 

e, when e > S-y/alogn, we have > C2a] < l/2n^, for C2 > max{e,26/a}. In the case of 

small e, when e < S-^/alogn, we have f{\Ai\ > max{e-^i'm, C2a}} < l/2n^, for Ci < 1/600^0 
and C2 > 130. Here we made a moderate assumption of e > 3y/a, which is typically in the region of 
interest. 

Analogously, we can prove that P{|^r| > maxje"*^^*^?!, C2}} < l/2n^ , which finishes the proof 
of Remark 



B Proof of Remark 14.41 

The expectation of the contribution of light couples, when each edge is independently revealed with 
probability e/\/mn, is 



E[Z] 



'mn 



where we define M"^ by setting to zero the rows of M whose index is not in Ai and the columns of 
M whose index is not in A^. 

In order to bound YIl ^i-^ijVj ~ x'^My, we write. 



^ XiMfjVj-x^My 



< 



x^iM"^ - M]y 



+ 



Note that |(M^ — M)ij\ is non-zero only \i i ^ Ai or j ^ Ar, in which case |(M^ — M)ij\ < 
-^max- Also, by Remark 14.21 there exists 5 = max{e~'^i'^, C2/n} such that \i : i ^ Ai\ < 6m and 
\j ■ j ^ Ar\ < 6n. Denoting by I( • ) the indicator function, we have 



x^(M^ -M]y 



< ( v6rn^/n + y&n^pm ) Mj 



mn 
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for 5<i-.We can bound the second term as follows 



4e 



J2 ^i^ivj 



< 



< 



< 



E 



1 Imn X - I A 
7 \ — 1^ \^^M,.Jy, 

-'max V — 
-^max V ^ .^r 1 1 



mn 




< M„ 



where the second inequality follows from the definition of heavy couples. 
Hence, summing up the two contributions, we get 

|E[Z]| < 2M^ax^/^. 



C Proof of Remark 14.51 

We can associate to the matrix Q a bipartite graph Q = ([m], The proof is similar to the one 

in |FKS891 IFOOSj and is based on two properties of the graph Q: 

1. Bounded degree. The graph Q has maximum degree bounded by a constant times the average 
degree: 

deg(i) < (69) 



deg(i) < 2eV^, (70) 

for all i £ [m] and j G [n] . 

2. Discrepancy. We say that Q (equivalently, the adjacency matrix Q) has the discrepancy prop- 
erty if, for any A C [m] and B Q [n], one of the following is true: 

' -(^'^^^-(^) ^'^'/^''^'^>^K max{|^i;^|i.|V^} ) ■ 

for two numerical constants ^i, ^2 (independent of n and e). Here e{A, B) denotes the number 
of edges between A and B and fj,{A,B) = \A\\B\\E\/mn denotes the average number of edges 
between A and B before trimming. 

We will prove, later in this section, that the discrepancy property holds with high probability. 
Let us partition row and column indices with respect to the value of Xu and y^: 

Ai = {u£ [m] : -^2'-^ < < ^2^} , 



Bj = {ve [n] : < |y,| < A2i} 
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for i e {1,2, . . . , [In (v^/A)/ln2]}, and j G {1,2, . . . , [In ( V^/ A) /In 2]}, and we denote the size of 
subsets Ai and Bj by Oj and bj respectively. Furthermore, we define Cij to be the number of edges 
between two subsets Ai and Bj, and we let = aibj{e/^/mn). Notice that all indices u of non 
zero Xu fall into one of the subsets Aj's defined above, since, by discretization, the smallest non-zero 
element of x G in absolute value is at least A / y/m. The same applies for the entries of y E T„ . 
By grouping the summation into j4j's and -Bj's, we get 

sr ri \ \ \^ A2^ A2J 



|a;„j/„|>-^ 

I ^ I — ^ran 



A 2 \ " I e Cj j 2* 2-' 



'mn jiij yjm yjn 



A Ve Vfli — &j "hjTi' 



Note that, by definition, we have 

J]a,<4||x||VA2, (73) 

i 

^A<4||y||VA2. (74) 

i 

We are now left with task of bounding ^ OiPjaij, for Q that satisfies bounded degree property and 
discrepancy property. 
Define, 

Ci = |(i,j) : 2*+^' > and (A^Bj) satisfies (I7I])| , (75) 

C2 = |(i,j) : 2^+^' > ^^1^ and (^„5j) satisfies (I72])| \Ci . (76) 

We need to show that Yl(^i j)^CiUC2 bounded. 

For the terms in Ci this bound is easy. Since summation is over pairs of indices (i, j) such that 
2i+j > 40/6^ follows from bounded degree property that aij < ^lA^/AC. By Eqs. ([73]) and (HI]), 
we have J2c, ^iPj<^i,j < (6 AV4C)(2/A)4 = 0(1). 

For the terms in C2 the bound is more complicated. We assume < abj for simplicity and 
the other case can be treated in the same manner. By change of notation the second discrepancy 
condition becomes 

— < 6max{ai/Va,^jVa}log 7 — ^ ^ . (77) 

We start by changing variables on both sides of Eq. ()77p . 

^ log ^ < ^2&,Valog — 



I 1 I — sz^v V " 1 a 
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Now, multiply each side by T' jhj^tl^ to get 

a,,,a, log < [log(22^) - log . (78) 

To achieve the desired bound, we partition the analysis into 5 cases: 

1. fJij < 1 : By Eqs. ^ and ([TID, we have < (2/^)'' = 0(1). 

2. 2* > -y/e2-' : By the bounded degree property in Eq. ([TOl) . we have e^j- < ai2e/\/a, which implies 
thateij//iij < 2n/6j. For a fixed z we have, /3iO-i,iI(2* > \/e2J ) < 2 I]^- 2^-*][(2* > Ve2^ ) < 
4. Then, Eai/?if^*,i < IG/A^ = 0(1). 

3. log (eij7/iij) > \ [log(2^J') — log/?j] : From Eq. ([78|) . it immediately follows that ajjaj < ^^j- 
Due to case 2, we can assume 2* < \ft2\ which implies that for a fixed j we have the following 
inequality : Ei < ^6 Ei "^1(2* < \/e2^ ) < 8^2- Then it follows by Eq. ^ that 
E«i/?jf^M <32e2/A2 = 0(l). 

4. log(2^'') > — log/?j : Due to case 3, we can assume log [cij / fj^ij) < j [log(2^'') — log/3j], which 
implies that log (eij//ijj) < log(2''). Further, since we are not in case 1, we can assume 
1 < CTjj- = eiJ^/e/ ^ij2^~^^ . Combining those two inequalities, we get 2* < ^/e. 

Since in defining C2 we excluded Ci, if S C2 then log (eij/fiij) > 1. Applying Eq. (f78l) we 

get aijai < a^jOilog (e,j///,j) < {C22'-^/V^) [log{2^^) - log(3j] < 4^227^^. 

Combining above two results, it follows that Ej (^ij^^i ^ 4^2 Ej "^^(2* ^ V^) ^ ^'^2 • Then, 
we have the desired bound : E'^i/^j^i.j — — 

5. log(22j) < -logPj : It follows, since we are not in case 3, that log (eij/fiij) < | [log(22j) - log (3j] < 
— logPj. Hence, eij/nij < l/Pj- This implies that aij = eiJ^/e/ fiij2'^~^^ < -y/e//3j2*+-'. Since 
the summation is over pairs of indices (i,j) such that 2*"'"-' > 4C-y/e/A^, we have Ej ^''jj/^i ^ 

Then it follows that ^ ^ = 0(1)- 

Analogous analysis for the set of indices (i,j) such that Oj > abj will give us similar bounds. 
Summing up the results, we get that there exists a constant C < + + + + such 
that 

aiPjdij < C' . 



(ij):2'+J>-^, 



This finishes the proof of Remark 14.51 

Lemma C.l. The adjacency matrix Q has discrepancy property with probability at least 1 — l/2n^. 

Proof. The proof is a generalization of analogous result in [FKS891 IFO05| which is proved to hold only 
with probability larger than 1— e~*^^. The stronger statement quoted here is a result of the observation 
that, when we trim the graph the number of edges between any two subsets does not increase. Define 
Qo to be the adjacency matrix corresponding to original random matrix before trimming. If 
the discrepancy assumption holds for Qq, then it also holds for Q, since 6*^(^4,5) < e^°{A,B), for 
A C [m] and B C [n]. 

Now we need to show that the desired property is satisfied for Qq. This is proved for the case 
of non-bipartite graph in Section 2.2.5 of [FO05j . and analogous analysis for bipartite graph shows 
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that for all subsets A C [m] and B C [n], with probability at least 1 — 1/2 (mn)^, the discrepancy 
condition holds with = 2e and ^2 = (3p + 12){a^^'^ + a"^/^). Since we assume a > 1, taking p to 
be 3/2 proves the desired thesis. □ 

D Proof of remarks 16.11 16.21 and 16.31 

Proof. (Remark 16. 1[ ) Let 9 = (9i, . . . ,6p), 9i £ [— 7r/2,7r/2] be the principal angles between the 
planes spanned by the columns of Xi and X2. It is known that dc{Xi, X2) = 1 12 sin(0/2)| I2 and 
dp{Xi, X2) = II sin0||2. The thesis follows from the elementary inequalities 

-a<V2 sin(a/2) < sin a < 2sin(a/2) (79) 
vr 

valid for a G [0,7r/2]. □ 
Proof. (Remark [62]) Given X G E"^^ define X' by 

^'« = ^(^min(||xW||,^) 

for all i E [n]. 

Let A be a matrix for extracting the ortho-normal basis of the columns of X' . That is A £ W^"^ 
such that X" = X'A and X"'^X" = nl. Without loss of generality, A can be taken to be a 
symmetric matrix. In the following, let ai = ai{A~^) for all i G [n]. Note that by construction 
d{U,X') < d{U,X) < 5. Hence there is a Qi e 0(r) such that, 

\\U-X'Qi\\l < n6^ (80) 

We start by writing 

nA-^A-^ = X'^X' = Qi(nl - {U - X'QifU + {U - X'Qi) + {U - X'Qif{U - X' Qi))QI .{SI) 
Using ([80j) . we have 

\\{U-X'Q^fU\\F < ||[7||2||(C/-X'Qi)I|f 
< n5, 



and 

\\{U-X'Qif{U-X'Qi)\\p < n5\ 



Therefore, using ([5T|) 

arl < l + 25 + 5^ (82) 

^2 > 1-26-6^. (83) 

From ([83]) and 5 < 1/16, we get ai < Vs and ar > l/VS. Since ||X"W||2 = ||X'Wyl||2 < 3/ior 
for all i £ [n], we have that X" £ /C(3^o)- 
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We next prove that d{X',X") < 35 which imphes the thesis by triangular inequahty. 

d(X',X"f = - min \\X'-X"Q\\l 
nQeO(r) 

< -\\X"A~^ -X"\\% 
n 

< + 1)111 

< \\A-^ A-^ - 

< 95^ 

where the last inequality is from (|8ip . □ 
Proof. (Remark 16.31 ) We start by observing that 

dp(V,Y) = ^ min \\V-YA\\f. (84) 



Indeed the minimization on the right hand side can be performed explicitly (as ||^ — is a 

quadratic function of A) and the minimum is achieved at ^ = Y'^V/n. The inequality follows by 
simple algebraic manipulations. 
Take A = X'^UT.'^ /m. Then 

\\V-YA\\f = sup {B,{V-YA)) (85) 

B,\\B\\p<l 

sup {B'^ ,—T.-^U^{UT.V^ - XSY^)) (86) 

■B,||-B||f<1 

= — sup {UY.-^B^ ,{M -M)) (87) 
m ij,||ij||^<i 

< — sup \\UT.-'^B^\\f\\M -M\\f . (88) 
m b,\\b\\f<i 

On the other hand 

WUT.-^B'^W], = Tv{BY.-^U'^UT,-^B^) = mTiiB'^ BT,'^) < mT,^J\B\\l , 

whereby the last inequality follows from the fact that S is diagonal. Together (j84p and (|88p . this 
implies the thesis. □ 
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